Gapminder and tidyverse:library(gapminder)
library(tidyverse)
Gapminder data was loaded properly:head(gapminder) %>%
knitr::kable()
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
We first check some general information about the Gapminder dataset before manipulating it.
gapminder %>%
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
It is easy to see that the variables country and continent are both factors with 142 and 5 levels respectively.
arrange() on numeric variablesWe first explore the effects of the arrange() functions. In the following code, we attempt to sort the data by the mean life expectancy in 2007 for each continent:
gap_meanLifeExp_1 <- gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(meanLifeExp = mean(lifeExp)) %>%
arrange(meanLifeExp)
knitr::kable(gap_meanLifeExp_1)
| continent | meanLifeExp |
|---|---|
| Africa | 54.80604 |
| Asia | 70.72848 |
| Americas | 73.60812 |
| Europe | 77.64860 |
| Oceania | 80.71950 |
From the resulting table, it seems that the arrange() function does have an effect on the output data frame, as the order of the continents is now following the increasing order of the mean life expectancy.
However, if we use the resulting data frame for a figure, it is obvious that the arragement of the bars follows the alphabetical order of the names of the continents, rather than the order of the mean life expectancy:
gap_meanLifeExp_1 %>%
ggplot(aes(x = continent, y = meanLifeExp)) +
geom_bar(stat = "identity")
arrange() on factor variablesInstead of sorting the data using a numeric variable, we now attempt to sort the data using a categorical variable (i.e., a factor). To demonstrate, let us sort the data by continents in the reverse alphabetical order. This can be done by using arrange() in combination with desc(), as shown in the following code:
gap_meanLifeExp_2 <- gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarize(meanLifeExp = mean(lifeExp)) %>%
arrange(desc(continent))
knitr::kable(gap_meanLifeExp_2)
| continent | meanLifeExp |
|---|---|
| Oceania | 80.71950 |
| Europe | 77.64860 |
| Asia | 70.72848 |
| Americas | 73.60812 |
| Africa | 54.80604 |
It appears that the data frame now is ordered as intented, but when we feed this data frame into a figure, the intented order is not reflected in the resulting figure:
gap_meanLifeExp_2 %>%
ggplot(aes(x = continent, y = meanLifeExp)) +
geom_bar(stat = "identity")
From the Gapminder, we now drop the observations associated with Oceania in variable continent. Using the str() function, we can see that even after dropping these observations, the levels in the factor varialbes continent and country remain the same as in the original data set (5 and 142 levels respectively):
gap_no_oceania <- gapminder %>%
filter(continent != "Oceania")
str(gap_no_oceania)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
However, the total number of rows in the data frame is now different (1680 vs. 1704).
Now we explore the effects of two similar functions: droplevels() and fct_drop(). With droplevels(), the function goes through each factor variables and eliminate the levels that do not appears in any observations. In this case, continent now only has 4 levels (i.e., Oceania is now gone), and country now has 140 levels (i.e., Australia and New Zealand are dropped).
gap_no_oceania %>%
droplevels() %>%
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
With fct_drop(), however, we can specify the variable whose levels we want to drop. In the following example, we drop the used level in continent. Notice that continent now has only 4 levels, but country still has 142 levels.
gap_no_oceania %>%
mutate(continent = fct_drop(continent)) %>%
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
We have seen that, if we want to present data in a tabular format, we can using arrange() to do this. However, if we want the same order to appear in a plot, we need to use fct_reorder() or fct_reorder2(). To illustrate the functions, we make a plot that shows the growth in life expectancy from 1952 to 2007 in Asian countries, with the order of the countries based on the amount of growth.
First, we create a data frame that contains a variable growth showing the difference in life expectancy between 2007 and 1952. Furthermore, we arrange the order of country with reference to growth using fct_reorder.
gap_growth <- gapminder %>%
filter(year %in% c(1952, 2007), continent == "Asia") %>%
select(country, year, lifeExp) %>%
spread(key = year, value = lifeExp) %>%
mutate(growth = `2007` - `1952`, country = fct_reorder(country, growth))
knitr::kable(gap_growth)
| country | 1952 | 2007 | growth |
|---|---|---|---|
| Afghanistan | 28.801 | 43.828 | 15.027 |
| Bahrain | 50.939 | 75.635 | 24.696 |
| Bangladesh | 37.484 | 64.062 | 26.578 |
| Cambodia | 39.417 | 59.723 | 20.306 |
| China | 44.000 | 72.961 | 28.961 |
| Hong Kong, China | 60.960 | 82.208 | 21.248 |
| India | 37.373 | 64.698 | 27.325 |
| Indonesia | 37.468 | 70.650 | 33.182 |
| Iran | 44.869 | 70.964 | 26.095 |
| Iraq | 45.320 | 59.545 | 14.225 |
| Israel | 65.390 | 80.745 | 15.355 |
| Japan | 63.030 | 82.603 | 19.573 |
| Jordan | 43.158 | 72.535 | 29.377 |
| Korea, Dem. Rep. | 50.056 | 67.297 | 17.241 |
| Korea, Rep. | 47.453 | 78.623 | 31.170 |
| Kuwait | 55.565 | 77.588 | 22.023 |
| Lebanon | 55.928 | 71.993 | 16.065 |
| Malaysia | 48.463 | 74.241 | 25.778 |
| Mongolia | 42.244 | 66.803 | 24.559 |
| Myanmar | 36.319 | 62.069 | 25.750 |
| Nepal | 36.157 | 63.785 | 27.628 |
| Oman | 37.578 | 75.640 | 38.062 |
| Pakistan | 43.436 | 65.483 | 22.047 |
| Philippines | 47.752 | 71.688 | 23.936 |
| Saudi Arabia | 39.875 | 72.777 | 32.902 |
| Singapore | 60.396 | 79.972 | 19.576 |
| Sri Lanka | 57.593 | 72.396 | 14.803 |
| Syria | 45.883 | 74.143 | 28.260 |
| Taiwan | 58.500 | 78.400 | 19.900 |
| Thailand | 50.848 | 70.616 | 19.768 |
| Vietnam | 40.412 | 74.249 | 33.837 |
| West Bank and Gaza | 43.160 | 73.422 | 30.262 |
| Yemen, Rep. | 32.548 | 62.698 | 30.150 |
Notice that this time the data frame appears to be arranged differently than expected: it is still ordered by the variable country rather than by growth. However, when we use the same data frame to generate a plot, the bars are arranged as intended:
gap_growth %>%
ggplot(aes(y = growth, x = country)) +
theme_bw() +
geom_col() +
geom_point(aes(y = `2007`, x = country, color = "Life exp. in 2007")) + # We need to add points manually because of the data frame structure
geom_point(aes(y = `1952`, x = country, color = "Life exp. in 1952")) +
labs(title = "Growth in life expectancy in Asian countries", x = "Country", y = "Age (year)") +
scale_color_manual(name = "", values = c("Life exp. in 2007" = "#00BFC4", "Life exp. in 1952" = "#F8766D")) +
coord_flip() +
theme(legend.position = "bottom")
From the plot, we can immediately see that Oman is the country that has the most significant growth in life expectancy, while Iraq saw the smallest growth among Asian countries.
Here I experiment with the write_tsc() and read_tsv() functions with the gap_growth data frame created above. Particular attention is given to the variable country.
We first check the type of country, which is factor. We then save the data frame as a tsv file (gap_growth.txt). Upon importing the data frame from the tsv file again, let us investigate the type of country, to see if there is any change:
class(gap_growth$country) # "factor"
# Save the data frame
gap_growth %>%
write_tsv(path = "./data/gap_growth.txt")
# Read in the data frame
gap_growth_fct_char <- read_tsv(file = "./data/gap_growth.txt", col_names = TRUE)
class(gap_growth_fct_char$country) # "character"
We see that upon read-back, the type changed from factor to character.
Next, we change the values in country from alphabetical factor levels to non-alphabetical (i.e., numeric) factor levels. Similarly, we save this new data frame and read in it again to see if the type of the country variable has changed:
# Change the levels from alphabetical to numeric
gap_growth_fct_num <- gap_growth %>%
mutate(country = as.factor(as.numeric(country)))
class(gap_growth_fct_num$country) # "factor"
# Save the data frame
gap_growth_fct_num %>%
write_tsv(path = "./data/gap_growth_num.txt")
# Read in the data frame
gap_growth_num <- read_tsv("./data/gap_growth_num.txt", col_names = TRUE)
class(gap_growth_num$country) # "integer"
Once again, the type has changed from factor to integer. We will need to use as.factor() to switch the type back to factor.
For this part, I remade one of the plots from Homework 04 that shows the life expectancy growth in different continents. The vanilla plot output by ggplot2 with defaults values is as follows. Also, local regression LOESS (locally estimated scatterplot smoothing) is used for plotting the lines.
gapminder %>%
group_by(country) %>%
mutate(growth_lifeExp = lifeExp - lag(lifeExp)) %>% # Calculate the growth compared to the previous year
filter(year != 1952, continent != "Oceania") %>%
ggplot(aes(x = year, y = growth_lifeExp, color = continent)) +
facet_wrap(~continent) +
geom_jitter() +
geom_smooth(method = "loess") # Fit regression lines
Then a few tweaks were added to the plot:
Solarized from ggthemes was used instead (Solarized is a sixteen color palette designed for use with terminal and gui applications).alpha in geom_jitter to a lower value, so that the points would not appear clustering all together.x axis to match the years in the Gapminder dataset.y axis to make the declining trend more visible. This of course implies that some data points are not shown in the plot.Gapminder.#install.packages("ggthemes")
library(ggthemes)
gapminder %>%
group_by(country) %>%
mutate(growth_lifeExp = lifeExp - lag(lifeExp)) %>%
filter(year != 1952, continent != "Oceania") %>%
ggplot(aes(x = year, y = growth_lifeExp, color = continent)) +
facet_wrap(~continent) +
geom_jitter(alpha = 0.1) +
geom_smooth(method = "loess") +
scale_x_continuous(breaks = seq(1957, 2007, 10)) +
coord_cartesian(ylim=c(-2, 4)) +
labs(x = 'Year', y = 'Life expectancy growth (years)', title = 'Growth in life expectancy') +
theme_solarized() +
scale_color_manual(values = continent_colors) +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "none"
)
plotlyWe can convert the plot above to a plotly graph. With a plotly graph, the user can interact with the graph by hovering over a point to see its values, zooming in and out, and panning around, among others.
library(plotly)
p <- gapminder %>%
group_by(country) %>%
mutate(growth_lifeExp = lifeExp - lag(lifeExp)) %>%
filter(year != 1952, continent != "Oceania") %>%
ggplot(aes(x = year, y = growth_lifeExp, color = continent)) +
facet_wrap(~continent) +
geom_jitter(alpha = 0.1) +
geom_smooth(method = "loess") +
scale_x_continuous(breaks = seq(1957, 2007, 10)) +
coord_cartesian(ylim=c(-2, 4)) +
labs(x = 'Year', y = 'Life expectancy growth (years)', title = 'Growth in life expectancy') +
theme_solarized() +
scale_color_manual(values = continent_colors) +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none")
ggplotly(p)
The plot generated by ggplot2 can be saved using ggsave() with desired arguments. Notice that if we do not supply the plot = argument, the function will save the last plot generated by ggplot2. Also, the size of text will not scale with specified height and width, so it might take some experiments to find the best combination of these parameters. Mutilple image formats are supported by ggsave(), including .svg, which is a vector format, and .png, which is a raster format. The following code shows how to save the image with these two formats respectively:
library(svglite) # So we can save .svg images
# units specified the unit used in height and width
ggsave(filename = "./plots/growth.svg", plot = p, units = "cm", height = 10, width = 15)
q <- p + theme_economist() # Change the theme of the plot
ggsave(filename = "./plots/growth.png", units = "cm", height = 10, width = 15)
The saved images can be imbedded into Markdown with the following syntax:

